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Abstract 

Background: High-throughput sequencing can identify numerous potential genomic targets for microbial strain 
typing, but identification of the most informative combinations requires the use of computational screening tools. 
This paper describes novel software - Automated Selection of Typing Target Subsets (AuSeTTS) - that allows 
intelligent selection of optimal targets for pathogen strain typing. The objective of this software is to maximise 
both discriminatory power, using Simpson's index of diversity (D), and concordance with existing typing methods, 
using the adjusted Wallace coefficient (AW). The program interrogates molecular typing results for panels of 
isolates, based on large target sets, and iteratively examines each target, one-by-one, to determine the most 
informative subset. 

Results: AuSeTTS was evaluated using three target sets: 51 binary targets (13 toxin genes, 16 phage-related loci 
and 22 SCCmec elements), used for multilocus typing of 153 methicillin-resistant Staphylococcus aureus (MRSA) 
isolates; 17 MLVA loci in 502 Streptococcus pneumoniae isolates from the MLVA database (www.mlva.eu) and 12 
MLST loci for 98 Cryptococcus spp. isolates. 

The maximum D for MRSA, 0.984, was achieved with a subset of 20 targets and a D value of 0.954 with 7 targets. 
Twelve targets predicted MLST with a maximum AW of 0.9994. All 1 7 S. pneumoniae MLVA targets were required to 
achieve maximum D of 0.997, but 4 targets reached D of 0.990. Twelve targets predicted pneumococcal serotype 
with a maximum AW of 0.899 and 9 predicted MLST with maximum AW of 0.963. Eight of the 1 2 MLST loci were 
sufficient to achieve the maximum D of 0.963 for Cryptococcus spp. 

Conclusions: Computerised analysis with AuSeTTS allows rapid selection of the most discriminatory targets for 
incorporation into typing schemes. Output of the program is presented in both tabular and graphical formats and 
the software is available for free download from http://www.cidmpublichealth.org/pages/ausetts.html. 

Keywords: Comparative genomics, Multilocus sequence typing, MVLA, Binary typing, Software, Microbial typing, 
MRSA, Cryptococcus, Staphylococcus aureus, Streptococcus pneumoniae 



Background 

Microbial strain typing schemes, with variable dis- 
criminatory powers, are increasingly applied to study 
long-term evolution, detect emergence of new or hyper 
virulent clones, identify outbreaks and track transmis- 
sion chains. New high-throughput DNA sequencing me- 
thods identify hitherto unrecognised variation in the 
genomes of even closely related isolates, which is a valu- 
able source of targets for use in new microbial typing 
schemes. These genotyping systems can be tailored to 
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have discriminatory power appropriate for the purpose 
[1] but systematic assessment of the characteristics of 
potential targets is required to ensure the quality and re- 
liability of the resulting typing scheme. 

Existing typing systems involve interrogation of several 
genetic loci to determine sequence variation (e.g. multi- 
locus sequence typing, MLST), length polymorphisms 
(e.g. multi-locus variable number of tandem repeats 
analysis, MLVA) or the presence or absence of genetic 
targets (i.e. binary typing). Next generation sequencing 
technologies have yielded vast amounts of sequencing 
information for a wide variety of organisms, and bench 
top sequencers permit real-time sub typing of bacteria 
by sequencing small batches of bacteria in a matter of 



© 2013 O'Sullivan et al.; licensee BioMed Central ttd. This is an Open Access article distributed under the terms of the Creative 
BlOlVlGCl C^ntrBl Commons Attribution License (http://creativecommons.Org/licenses/by/2.u), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



O'Sullivan et al. BMC Bioinformatics 2013, 14:148 
http://www.biomedcentral.com/1471-2105/14/148 



Page 2 of 8 



hours [2]. This has prompted some to advocate whole 
genome sequencing as a routine typing method [3], but 
limitations of data analysis and assigning cut-offs for re- 
latedness mean that whole genome data is more com- 
monly used to identify loci that may be useful to design 
informative typing systems [4]. A critical step in deciding 
which loci to incorporate into such typing systems is to 
estimate the discriminatory power and concordance with 
other typing systems that would be achieved with differ- 
ent combinations of loci. 

The essential characteristics of a microbial typing sys- 
tem include appropriate discriminatory power for the 
research question being studied, consistency with both 
clinical epidemiology and established typing methods, 
stability, reproducibility, type ability, ease of use and in- 
terpretation, high throughput and low cost [5]. 

Discriminatory power is most frequently assessed 
using Simpson's index of diversity (D), which gives the 
probability that isolates randomly selected from a popu- 
lation would differ using the typing method. 

A number of indices can likewise be used to measure 
concordance between typing systems or between a typing 
system and epidemiologic classifications. The Wallace co- 
efficient (W) estimates the probability that two isolates 
assigned the same type by the method under evaluation 
(Mi) belong to the same type using the comparator 
method (M 2 ). W is a directional measure; that is the re- 
sults for the concordance of Mi with M 2 are different 
from those of the concordance of M 2 with Mi. 

When choosing targets identified by comparative ge- 
nomics for incorporation into a new typing system, a 
good starting point is to select those that in combination 
give the most favourable results for these measures 
of discriminatory power and/or concordance using an 
existing collection of typed isolates. However, examin- 
ation of every possible combination of candidate targets, 
individually, is often computationally expensive. For ex- 
ample, comparison of all possible subsets of 100 po- 
tential targets available for use in a typing system, to 
determine the most informative subset, would require 
10 30 calculations, which is beyond the capacity of stan- 
dard computers. Therefore, alternative approaches are 
required. Software has been developed to interrogate 
informative single nucleotide polymorphisms (SNPs) in 
sequence based data (Minimum SNPs) but it is not 
designed to handle other forms of typing data [6,7]. Fur- 
thermore, while it can be used to identify SNPs, which 
are most predictive of a user-nominated sequence type, 
it does not consider overall measures of concordance be- 
tween typing systems. We report here a new computa- 
tional approach selecting the most informative sets of 
genomic loci for multi-target microbial typing and dis- 
cuss its application to different typing methods for pa- 
thogenic bacteria and fungi. 



Implementation 

In constructing an approach for interrogating combi- 
nations of targets, which are either binary and/or mul- 
tistate (where a target can assume any of >2 possible 
values), we developed a heuristic based on the stepwise 
accumulation of informative targets. Here 'informative' 
means the combination of targets producing either the 
greatest discriminatory power or the greatest concord- 
ance with existing typing methods (as selected by the 
user). This heuristic assumes that the most informative 
combination of n + 1 targets includes the most inform- 
ative combination of n targets as a subset. While this as- 
sumption may not always hold true, it vastly reduces the 
number of combinations that need to be examined to 
determine the maximally informative subset of targets 
and it can be confirmed post-hoc for a given dataset. 

AuSeTTS (Automated Selection of Typing Target Sub- 
sets) is a software program designed to analyse a large array 
of typing data for a panel of isolates and determine the opti- 
mal combination of typing targets to maximise discrimin- 
atory power and/or concordance measures for a specified 
subset size. The analysis can be performed with (heuristic 
search) or without (exhaustive search) the heuristic de- 
scribed above. The software was written in Microsoft Visual 
Basic for Excel (2010); it is available for free download from 
http://www.cidmpublichealth.org/pages/ausetts.html and 
also accompanies this paper (Additional file 1). 

The input data consist of a table of typing results with 
the targets in columns and the isolates in rows. Each cell 
represents the result for a given target in a given isolate 
and is expressed as character-based data (for example 0 
or 1 for binary data, allele numbers for MLST or num- 
bers of repeats for MLVA data). One or more columns 
can be specified as the comparator typing method for 
calculating measures of concordance and typing results 
can be represented in the dataset multiple times by pro- 
viding numbers of isolates for each row in a specified 
column. Non-informative targets (i.e. which have the same 
result for every isolate or are completely concordant with 
a second target) are automatically removed from the set 
before analysis. 

Using the heuristic search, the software initially ranks 
each target by their individual discriminatory power or 
concordance. It then examines all other targets in com- 
bination with the most informative target(s) to identify 
the most informative combinations of two targets. Fur- 
ther targets are then added iteratively until the whole 
dataset has been examined. When a 'tie' between combi- 
nations is encountered each of the tied combinations 
continue to be considered, with additional targets being 
added until the ties are broken. Once the ties are bro- 
ken, the less informative combination(s) are abandoned. 
A 'threshold' is ultimately determined: the number of 
targets, beyond which adding more targets does not 
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further increase discriminatory power or concordance. 
Figure 1 presents a schematic overview of the program. 
The output is a list of targets for each subset size 
that maximise discriminatory power or concordance, 
with the results of these measures and 95% confidence in- 
tervals. The information is also presented graphically 
(Figure 2). 

Using an exhaustive search, the user specifies the 
number of targets to be included (the subset size). 
The software then examines every possible combin- 
ation of targets producing a subset of this size and 
calculates the discriminatory power (and, if specified, 
the concordance measures). The combinations with 
the highest achievable discriminatory power are 
returned, along with 95% confidence intervals. The 
exhaustive search gives a definitive result that is not 
dependent on the heuristic. It may not be feasible to 
examine very large datasets with an exhaustive search: 
on testing, examining a subset of 5 binary targets 
from a dataset of 20 targets for 100 isolates (15,504 
possible combinations) took 20 seconds, while doub- 
ling the number of targets to 10 from the same 
dataset increases the number of combinations to be 
examined by more than 10-fold which led to a cor- 
responding increase in the computing time. Thus the 
problem using the exhaustive search becomes NP- 
complete for very large datasets, and the heuristic ap- 
proach becomes necessary. 

Formulas 

The formula used for calculating D was as follows: 



D 



N(N 



i 



Where N is the number of isolates in the sample 
population, S is the number of distinct types identi- 
fied in the population and «, is the number of isolates 
of the type / [8]. The following formulas have been 
developed for calculating confidence intervals for D 
[9,10]: 



° =N 



CI = D-2y 
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Where a is the variance and CI is the approximate 
95% confidence interval. This formula used for variance 
is a large sample approximation; a non-approximated 
formula for variance has also been described [10]. 

To calculate W, the typing results for both methods 
for each isolate in the data set must be examined against 
those for every other isolate in the data set to see if they 



match or are discordant. The formula used for W is 
given by [11]: 



(My, Mi) 



a + b 



Where a is the number of instances where two isolates 
of the same type by method M 1 are of the same type by 
method M 2 , while b is the number of instances where 
two isolates of the same type by method M l are of a 
different type by method M 2 . The Adjusted Wallace 
coefficient (AW) incorporates an adjustment to account 
for concordance that may occur by chance alone. The 
formula for A W is given by [12]: 



AW, 



(Mi,M 2 



W( Ml ,M 2 ) + -D(M 2 



D, 



(M 2 



Where D (M2 ) is the Simpson's index of diversity of the 
dataset using typing method M 2 . In addition, the Rand 
(R), adjusted Rand (AR) and the approximate 95% confi- 
dence intervals of AW are also calculated [12,13]. The 
analytical confidence interval calculations for W may not 
be valid for lvalues of <0.5. An alternative method for 
calculation of confidence intervals for these measures of 
congruence is to use Jackknife resampling [14], for 
which an online tool is available [15]. 

Confidence intervals are provided for the purposes of 
comparison of results with other typing methods. How- 
ever, in the algorithm, only the point estimates of D, 
AW, or AR, without confidence intervals, were used to 
determine the most informative values of each combin- 
ation of targets. This approach reduces the complexity of 
the heuristic and, hence, the computation time required 
but the results relate only to the input dataset. The 
optimal combination of targets may therefore be different 
for larger sample sizes or samples from different popula- 
tions of the same microbial species. 

Results and discussion 

Validation 

To examine the robustness of the assumption that tar- 
gets may be added in a stepwise fashion while maximi- 
sing the parameter of interest (heuristic search), random 
datasets were generated and tested using both search 
types. These random datasets were defined by varying a) 
the number of targets, b) the number of different states 
each target could assume, c) the number of strain types 
and d) the number of isolates distributed (unevenly) 
amongst the strain types. 

For each dataset, a heuristic search was used to calcu- 
late the threshold subset size. The heuristic search result 
for a subset of one target less than the threshold was 
compared with an exhaustive search result specifying the 
same sized subset. If the resulting maximum parameter 
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Figure 1 Schematic overview of iterative assessment of typing targets conducted by AuSeTTS (heuristic search). 
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Figure 2 AuSeTTS graphical output. (A). Relationship between the number of target loci and the discriminatory power of molecular subtyping. 
Results for analysis of MRSA binary typing data. The maximum Simpson's Index of Diversity was achieved with a combination of 20 targets. 
(B). MRSA binary typing data analysis to maximise the Wallace coefficient. Maximum concordance of binary type to predict MLST was achieved 
with 1 2 binary targets, with an AW value of 0.994. 



value, using the exhaustive search was the same as that 
of the heuristic search, the heuristic was considered to 
be valid. If the maximal parameter value achieved by the 
heuristic search was less than that using the exhaustive 
search, the heuristic was considered not to have held. 
25600 randomly generated datasets were examined for 
each of the 5 parameters of interest. The heuristic was 
valid in 79.4% (95% confidence interval 79-80), 98.2% 
(98-99), 83.4% (0.83-0.84), 92.9% (92-93) and 93.6% 
(93-94) of random datasets for D, AW (A>B) , AW (B>A)| 
R and AR, respectively. 

Factors associated with failure of the heuristic to iden- 
tify the combination of targets that maximised D in- 
cluded: a value of D between 0.90 and 0.96, and a larger 
number of targets analysed. It performed best when the 



maximum D of the whole dataset was 1 (87.8% 95% CI 
87-89). The number of strain types, the number of iso- 
lates in the dataset and the number of states each target 
could assume did not influence the likelihood of the 
heuristic being valid. 

The heuristic performed well for all four concordance 
measures. Factors associated with a lower likelihood of 
the heuristic being valid for concordance measures 
included an increasing number of targets in the 
dataset, D value of the dataset between 0.9 and 0.96, 
examination of a subset of close to half of the total 
number of targets and, for AW( A> #), a maximum AW 
value between 0.1-0.35. 

Full details of the validation are available in the sup- 
plementary material (Additional file 2). 
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Application 

The software was used to analyse different forms of mi- 
crobial typing data generated by well-validated methods, 
specifically, binary typing data for Staphylococcus aureus 
[16-18], MLVA for Streptococcus pneumoniae [19] and 
MLST for Cryptococcus spp. [20,21]. 

Selection of targets for Staphylococcus aureus strain 
typing 

Typing results for 51 binary targets in 153 methicillin- 
resistant S. aureus (MRSA) isolates (42 well characte- 
rised reference isolates and 111 clinical isolates from our 
institution) were available from previous experiments in 
our laboratory [16-18]. The targets comprised: 13 toxin 
genes [17], 16 phage-derived open reading frames [18] 
and 22 SCCwec elements [16] which had been inter- 
rogated using multiplex-PCR reverse line blot assays 
[22,23]. 

The maximum D value of binary typing with all 51 
targets for this collection of MRSA isolates was 0.984 
(95% confidence interval 0.975-0.992). AuSeTTS heuris- 
tic search showed that this could be achieved with a sub- 
set of 20 binary targets, while a subset of just 7 targets 
achieved a D value of 0.954 (0.941-0.967) (Figure 2A). 
When used to predict MLST (which had been de- 
termined by either the conventional [24] or SNP-based 
[25] methods for all 153 isolates), a maximum Adjusted 
Wallace coefficient of concordance (AW) of 0.9994 
(0.999-1.000) was achieved with 12 targets (Figure 2B). 
One binary type consisted of two isolates with different 
MLST (which were single-locus variants). Isolates within 
each of the remaining binary types all belonged to one 
MLST type. 

This data was used to develop a novel 19-target binary 
typing system for MRSA [26]. 

Selection of targets for Streptococcus pneumoniae strain 
typing 

Results of MLVA typing, using 17 loci, for 1449 Strep- 
tococcus pneumoniae isolates (representing 906 possible 
MLVA types) were available from the MLVA online data- 
base (www.mlva.eu) [19] for analysis by AuSeTTS. A 
maximum D of 0.997 (0.997-0.998) was achieved with all 
17 loci but only 4 targets were required to achieve a D 
value of 0.990 (0.988-0.991), which divided the isolates 
into 438 MLVA types. 

A subset of the isolates for which MLVA results were 
available also had been serotyped (537 isolates represen- 
ting 43 serotypes and 398 MLVA types), and these we 
used to determine the combination of MLVA loci which 
could best predict the serotype. A maximum AW of 
0.899 (0.857-0.942) for serotype was achieved using 12 
of the MLVA loci. This particular combination of 12 tar- 
gets divided the dataset into 370 MLVA types, 352 of 



which contained only one serotype, while 15 contained 
two, two contained one and one MLVA type represented 
by 6 isolates harboured 5 different serotypes. 

A similar analysis was performed with MLST data 
which were available for 96 of the isolates consisting of 
27 sequence types (ST) and 77 possible MLVA types. A 
maximum AW of 0.963 (0.943-0.983) for MLVA to pre- 
dict ST was achieved with 9 targets which divided the 96 
isolates into 60 MLVA types. One MLVA type consisted 
of 3 isolates with 3 different MLST types. All other 
MLVA types consisted of isolates with matching MLST 
types. 

Selection of targets for Cryptococcus species strain typing 

Twelve MLST loci for 98 Cryptococcus spp. isolates from 
a previously published study [21] were examined using 
AuSeTTS. Eight of the 12 MLST loci provided a ma- 
ximum D of 0.963 (0.945-0.981) for Cryptococcus spp.in 
a heuristic search. The exhaustive search, specifying a 
subset size of seven loci, indicated the same maximal D 
value could be achieved with only seven loci; i.e. for this 
dataset, the heuristic was invalid but the most inform- 
ative combination of targets could still be identified 
using an exhaustive search. This analysis was used, in 
part, to determine the recommended targets for an in- 
ternational consensus protocol for MLST typing of Cryp- 
tococcus spp. [27]. 

Discussion 

AuSeTTS has been successfully applied to develop typ- 
ing schemes for MRSA [26] and Cryptococcus spp. [27] 
and would be useful to assess the discriminatory power 
of combinations of candidate targets for typing systems 
for other pathogens. It can be used for a wide range of 
data types, but for interrogation of informative SNPs, we 
recommend Minimum SNPs, which has been designed 
specifically for this purpose [6,7]. Minimum SNPs should 
be used to examine input data in the form of multiple se- 
quence alignments. AuSeTTS can also be used to examine 
the level of concordance between results produced using 
subsets of candidate targets and those of existing phe- 
notyping or genotyping methods or with epidemiologic 
classifications. Minimum SNPs does provide some func- 
tionality with regard to concordance measures (the "not- 
N" mode), but does not calculate the Wallace or Rand 
coefficients or confidence intervals for the adjusted 
Wallace coefficient. 

While the algorithm used in the heuristic search may 
not always provide a definitive result for the minimum 
subset size required for the maximal D value, it will be 
correct in the majority of cases. For smaller datasets, 
an exhaustive search can easily be undertaken to con- 
firm the validity of the heuristic. This is particularly 
recommended if the dataset has several features that 
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were associated with a higher likelihood of the heuristic 
being invalid, such as low maximum D values, a thresh- 
old value close to 50% of the total number of targets, a 
number of states each target can assume of <8 and a 
large number of unique strain types. A worked example 
demonstrating the use of AuSeTTS (Additional file 3) 
using a sample dataset (Additional file 4) accompany 
this paper. 

Conclusions 

Computerised analysis with AuSeTTS enables rapid, au- 
tomated identification of the most informative targets 
for incorporation into novel molecular typing schemes 
for bacteria and fungi. Discriminatory power and con- 
cordance, while important, are only two of the many pa- 
rameters that need to be considered when developing a 
new molecular typing technique. Reproducibility, sta- 
bility, ease of use, ease of interpretation, throughput and 
cost are additional measures that require thorough 
assessment and comparison with existing methods du- 
ring development and evaluation of novel typing tech- 
niques [5]. 

Availability and requirements 

Project name: AuSeTTS 

Project home page: http://www.cidmpublichealth.org/ 
pages/ausetts.html 

Operating system(s): Microsoft Windows 
Programming language: Visual Basic for Applications 
Other requirements: Microsoft Excel for Windows 
License: Unrestricted Freeware 

Additional files 



Additional file 1: The AuSeTTS software file. 

Additional file 2: The full description of the heuristic search 
validation. 

Additional file 3: A worked example using the dataset in Additional 
file 4. 

Additional file 4: Sample AuSeTTS dataset. 
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